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Abstract 

We establish a fundamental relation between three different topics: Bayesian 
model selection, model averaging, and oracle performance. The relatively 
basic property of model selection consistency is shown to be equivalent to 
a seemingly more advanced distributional result, the oracle property. The 
result is very simple and general. There is no restriction on the type of prior 
or likelihood function used, or on the limiting distribution of the oracle pos¬ 
terior. A number of possible applications are discussed. 
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1 Introduction 

In one of the most highly cited statistical papers (which is cited more than 
2,500 times already), Fan and Li (2001) introduced an “Oracle property” for 
a frequentist penalization method of model selection, that statistical infer¬ 
ences “work as well as if the correct submodel were known.” Such a property 
has not been widely studied in the Bayesian context, with the exception 
of a few pioneering works: Ishwaran and Rao (2010) who considered linear 
models with spike and slab priors, Hong and Preston (2012) who addressed 
post selection prediction, and Li and Jiang (2014) who considered Bayesian 
generalized method of moments. However, it is well recognized that such a 
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problem is important: model selection and inference after selection belong 
to the No. 1. open problem in Bayesian statistics (Jordan 2011). 

The current paper reveals a simple and general relation between model 
selection consistency and the oracle property in the Bayesian context. The 
simplicity of this relation has motivated us to write a short paper. Instead 
of focusing on one application and developing the full details, we point out 
a diversity of possible applications with useful key references, leaving their 
complete development to possible future work. We believe that a longer pa¬ 
per, perhaps detailing applications to a particular example, would obscure 
the fundamental importance of the proposed simple and general relation. 
Nevertheless, we have included some details and technical proofs in the Sup¬ 
plementary Materials(SM), which also contain several other theoretical re¬ 
sults of possible interest. 

The main messages of this paper and its SM include: 

• In the Bayesian context, the seemingly more complicated oracle prop¬ 
erty on the posterior distribution is equivalent to the seemingly simpler 
property of posterior consistency of model selection. We do not believe 
that this is unknown before; more likely, such a relation may have al¬ 
ready been implicitly used in proofs of various special situations. The 
goal of our paper is to point out this relation explicitly, generally, and 
apply it systematically. 

• There are many previous works which have already established model 
selection consistency in their specific situations (see Section [I]). Our 
current results then imply that they have actually also proved oracle 
properties of the posterior distributions as well in each case, even if 
they did not mention this in their papers. With some additional effort 
in bounding the model selection error, their results can be used to 
establish the stronger oracle properties for posterior mean as defined 
in our SM. These lead to many possibilities for further extension of 
existing Bayesian asymptotic results to the context of Bayesian model 
selection. 

• The seemingly simpler property of posterior model selection consistency 
is indeed not a strong assumption. There exist general ways to prove 
it and bound the convergence rate, based on a general framework of 
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quasi-posterior, which allows any empirical risk to be used as the (quasi- 
) log-likelihood. 

• The quasi-posterior framework allows nonstandard limiting distribu¬ 
tion for the oracle posterior, which can be nonnormal, or with non¬ 
standard convergence rate different from n 1//2 . As shown in the SM, 
we can accommodate discontinuous empirical risks with cubic-root- 
asymptotics, and partial identification in Bayesian moment inequali¬ 
ties. 

• Many of the relations that we established in this paper and its SM are 
assumption-free and simple, therefore very general. They reveal intrin¬ 
sic relations among a variety of concepts without being kidnapped by 
technical features that are specific to particular statistical models. To 
name a few of such relations that appear in the main text: Proposi¬ 
tions 1 relates total variational distance from the oracle performance 
to model selection error; Proposition 2 relates model averaging, model 
selection, and oracle model; Proposition 3 relates the model selection 
error to the risk function in the quasi-posterior. 


2 Bayesian Model Averaging 

Let 7T be any probability measure, which will be taken to be the posterior 
probability measure conditional on the observed data. Let M be a random 
index which will be taken to index a model, and Mo be a possible value of 
M, which will be taken as the true model index under which the data are 
generated. 

For any event A, we are interested in the difference 

tt(A) - n(A\M 0 ), 

where n(A\M 0 ) is the “oracle” posterior, pretending to have known the true 
model M 0 , whereas vr(A) = '^2 m tt(M)tt(A\M) is the mixed posterior via 
model averaging, allowing possibilities of all models as weighted by the model 
posteriors 7r(M) given the data. 

Let us try to define an “oracle property” as the following (Oi)-f(Oii): 
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(Oi) 1 - vr(M 0 ) = o p (l) AND 


(Oii) sup^gj- \ti(A) — tt(A\M 0 )\ = o p (l) for any set of events T that may 
be data dependent [j 

That is, the true model has posterior probability converging to 1 as n 
increases, AND the limiting behavior of the posterior distribution (for any 
parameter) in total variation is equivalent to the “oracle posterior” pretend¬ 
ing to have known the true model 0 This kind of oracle property is similar in 
essence to the frequentist oracle property of Fan and Li (2001) but is more 
general. 

To fully appreciate the generality of the current definition, note that in 
the current general context, we do not require the oracle posterior to 

have an asymptotic normal limiting distribution at a rate of square root n. 
The most attractive aspect of Fan and Li’s oracle property is that the infer¬ 
ential results “work as well as if the correct submodel were known” (Fan and 
Li 2001, Abstract). This aspect is already fully captured in (Oii) and there 
is no need to impose restrictions on the nature of the oracle posterior 7t(-|Mo). 

Such relaxation allows us to include many examples (in additional to clas¬ 
sical examples with asymptotic normal limits), such as the situation with 
discontinuous posterior distribution which is characterized by cubic root n 
asymptotics (see, e.g, Jun, Pinske and Wang 2012), and partially-identihable 
posterior distributions with 0(1) rate asymptotics (see, e.g., Liao and Jiang 
2010). ft is also noted that the current framework allows quasi-posterior 
distributions (such as a posterior based on a partially misspecified Gaussian 
model, or the Bayesian generalized method of moments considered in Li and 
Jiang 2014), which may not be constructed from a likelihood function, but 
nevertheless forms a quasi-posterior - so long as it forms a probability mea¬ 
sure 7T0. 

The formulation (Oii), of course, can also be applied to classical situa¬ 
tions. The definition in the convergence in probability of the total varia- 

4 In fact (Oii) is sufficient to imply (Oi), since we can let the event A be [M = Mo]. 

5 We here only consider the oracle property of the entire posterior distribution. A 
different kind of oracle property for the posterior mean is discussed in the Supplementary 
Materials. 
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tion difference allows a natural connection to classical Bayesian asymptotic 
normality or Bernstein von Mise theorem on the oracle posterior, which is 
formulated in the same mode of convergence. 

The following fundamental equality can be derived: 

Propostion 1. 

sup |7r(A) - n(A\M 0 ) \ = 1 - tt(M 0 ), 

AeT 

This proposition reveals a deep relation between three different topics: 
model averaging (7r), oracle performance (7 t(-|M 0 )) , and model selection 
( 7 r(M 0 )). The total variation distance between the model average posterior 
and the oracle posterior is exactly equal to the posterior probability of miss¬ 
ing the true model. 

Therefore, if there is model selection consistency 
(Oi) 1 - vr(M 0 ) = o p (l), 

then sup AgJ - \k(A) — tt(A\M 0 )\ = o p (l) for any set of events T that may 
be data dependent. Then in total variation, 7r() and 7r(-1M 0 ) have the same 
limiting distributions. 

Therefore, on this most general level, we have obtained that model selec¬ 
tion consistency (Oi) is equivalent to this oracle property (Oii) [or (Oi) + (Oii)]: 

Theorem 1. Posterior model selection consistency (Oi) is equivalent to the 
posterior oracle property (Oii) for Bayesian model averaging. 

3 Bayesian Model Selection 

Although we have talked about model averaging so far, similar results 
also exists for Bayesian model selection. Suppose M is any Maximum-A- 
Posteriori model choice, so that 7 r(M) = ma xmtt(M), we are interested in 
the total variation distance between the posterior 7r(-1 M) based on Bayesian 
model selection, and the oracle posterior based on the true model Mo. 

Then we have: 
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Propostion 2. The maximal total variation distance among the three pos¬ 
teriors: tt, and ir(-\M 0 ), is at most twice the posterior probability of 

missing the true model: 2[1 — 7r(M 0 )] . 

Therefore, model selection consistency 
(Oi) 1 - 7r(M 0 ) = Op (1), 

also implies 

(Oiii)* sup^^ \tt(A\M) — 7r(A\M 0 )\ = o p (l) for any set of events T that 
may be data dependent. 0 

Let us define the oracle property for Bayesian model selection to be 

(Oiii)=(Oi)+(Oiii) *, 


which includes one statement on consistent model selection, and another 
statement on post selection inference. Then, 

Theorem 2. Posterior model selection consistency (Oi) is equivalent to the 
posterior oracle property (Oiii) for Bayesian model selection. 

We suspect that the Bayesian model selection consistency (Oi) may be 
implicitly stronger than the frequentist version. For example, in the present 
context, a frequentist style condition of model selection consistency may be 
M = M 0 in probability, which can be achieved by letting 1 — 7r(M 0 ) < 0.5, 
instead of asking it to be o p (l). 

We now comment that (Oi) is indeed establishable in a wide variety of 
contexts, where people have most focused on the model selection consistency 
only, and stopped from stating its attractive implications of the oracle prop¬ 
erties (Oii) and (Oiii), perhaps subconsciously feeling that these would be 
much more difficult problems on the theory of limiting distributions. The fol¬ 
lowing section lists some of such possible applications where (Oi) has already 
been established. 

6 In this case we could not prove (Oiii)* => (Oi). 
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4 Possible applications 

There has been extensive work in Bayesian model selection consistency 
(Oi). All these results are possible to be extended to imply Bayesian oracle 
properties (Oii). When there are already known results on the limiting be¬ 
havior of the oracle posterior 7t(-\Mq) under the true model, then the limiting 
behavior will apply also to n from model averaging and to 7r(- 1 iff) from model 
selection. For example, asymptotic normality can be derived by the Bern¬ 
stein von Mises Theorems for various kinds of true models. For example, for 
finite dimensional true model, see van der Vaart, A.W. (1998), Section 10.2, 
and Shen (2002) for nonparametric and semiparametric situations. For true 
models being generalized linear models with increasing dimensions, one can 
apply Ghosal (2000). For quasi-Bayesian posterior, one can apply Bclloni 
and Chernozhukov (2009). 

The following are some examples where posterior consistency for model 
selection (Oi) has been studied: 

(i) (Classical cases) It is known that posterior consistency for model selection 

(Oi) commonly holds for the usual likelihood-based Bayes posterior, 
in classical finite dimensional cases, (see, e.g., Wasserman 1997 eqn. 
(42)). Then our result suggests that the oracle property (Oii) (for model 
averaging) and (Oiii) (for model selection) also hold. The implication 
of the oracle property can be the classical square root n asymptotic 
normal limiting distributions as suggested by the Bernstein von Mises 
Theorem (see, e.g., van der Vaart, A.W. (1998), Section 10.2). In these 
classical cases, model selection consistency (Oi) can be proved from the 
BIC approximation (Schwartz 1978). The BIC approximation will be 
extended in the Supplementary Materials to accommodate non-classical 
cases and prove model selection consistency. See also Item (iv) below 
regarding the cubic-root asymptotics. 

(ii) (GMM) Model selection consistency (Oi) is also proved in Li and Jiang 

(2014) who consider quasi-posteriors constructed from GMM (gener¬ 
alized method of moments), allowing increasing dimensionality with 
sample size. This is enough for them to derive the quasi-Bayesian 
oracle properties which imply asymptotic normality with efficient vari¬ 
ances, based on the Bernstein von Mises type results of Belloni and 
Chernozhukov (2009). 
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(iii) (GLM) For generalized linear models, in the high dimensional case (with 
dimension of parameters much higher than the sample size), Liang, 
Song and Yu (2013) proved the posterior model selection consistency 
(in their equation (19)). Our results indicate that their results imply 
the oracle properties on the posterior distribution of, e.g., the mean 
parameters. Asymptotic normality can be obtained from the oracle 
properties when the true model is sparse (bounded in n or moderately 
increasing), if one can apply the appropriate Bernstein von Mises The¬ 
orem on the true model, such as those studied in Ghosal (2000) (for 
generalized linear models with increasing dimensions). 

(iv) : (Cubic Root Asymptotics; We will discuss some of this in the Sup¬ 

plementary Materials.) Jun, Pinske and Wang (2012) have considered 
discontinuous quasi-posteriors. With some choice of a scaling param¬ 
eter used in the quasi-posterior, it is possible to guarantee the model 
selection consistency (Oi). Then oracle properties (Oii) and (Oiii) will 
hold for post-model-average or post-model-selection Bayesian inference. 
An interesting feature here is that the limiting distribution of the or¬ 
acle posterior does not follow the classical yfn- asymptotics. For some 
choices of the scaling parameter used in the quasi-posterior, thoughts 
following the approach of Jun, Pinske and Wang (2012) suggest a slower 
than n 1//3 convergence rate. 

(v) : (Partial Identification; We will discuss some of this in the Supplemen¬ 

tary Materials.) Liao and Jiang (2010) considered a quasi-posterior 
derived from moment inequalities, where parameters are not point- 
identified but only identified upto a set 12 called an identification region. 
In a model selection setup with a unique true model that intersects with 
120 their Theorem 4.1 implies model selection consistency (Oi). Then 
oracle results (Oii) and (Oiii) automatically hold. In this setup of par¬ 
tial identification, the limiting distribution of the oracle posterior is 
nonstandard: it does not shrink as n increases, so the rate is 0(1). 
Nor the limiting distribution is normal in general - the nonidentihabil- 
ity over 12 suggests that the limiting posterior distribution should be 
roughly the prior distribution truncated inside the identification region 
12. (See comments in Section 3.1, Liao and Jiang 2010.) 

7 This may be extended to the case with multiple true models, when we expect the 
limiting distribution to be a mixture distribution involving these multiple true models. 



(vi): (Quasi-posterior; We will discuss this in more details in the Supple¬ 
mentary Materials.) This is a general framework of quasi-posterior 
where we can derive general bounds on the mis-selection probability 
7 t(Mq) = 1 — 7 r(M 0 ). Here, 


7T oc e XRn n, 

where R n is an empirical risk related to a theoretical risk R, k is a 
prior, and A > 0 is a scaling parameter which can increase with n. The 
equations ([2D and ([2D in the Supplementary Materials (SM) directly 
imply the following assumption free bound for 7 t(Mq), which is used in 
the Supplementary Materials to obtain model selection consistency: 

Propostion 3. (Model selection with quasi-posterior.) 

1ii7t(Mq) < —0.5A(y — r — 2\u\) 


where 


7 = inf R — inf R, 

M§ 


r = —A 1 In 


e -A(fl-inffl) dre? 


and 


u = — (2A) In 


e -2A [(Rn-R)-J d<t>(R n -R)} ^ 


0 oc e xr k 


is the “limiting version” of quasi-posterior i t, where theoretical risk R 
is used in place of the empirical risk R n . 


This assumption free bound is only useful when 7 > r + 2\u\ > 0. We 
show in the SM that in fact typically we can make r + 2\u\ = o p ( 7 ), 
and so that can be exponentially small in A 7 and decreases 

very quickly with sample size n. This can happen with two methods: 
one includes a complexity penalty to the risk i?, and another adds 
no complexity penalty but considers candidate models with separated 
parameter spaces. 
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5 Discussion 


We have established a fundamental relation between three different top¬ 
ics: Bayesian model selection, model averaging, and oracle performance. The 
relatively basic property of model selection consistency is shown to be equiv¬ 
alent to a seemingly more advanced distributional result, the oracle property. 
The result is very simple and general. Unlike some previous Bayesian oracle 
properties discussed in special cases such as Ishwaran and Rao (2011) and Li 
and Jiang (2014), the current work is completely free from any restriction on 
the type of prior or (quasi-)likelihood function used, or even from any restric¬ 
tion on the limiting distribution of the oracle posterior. It may be the first 
time that the Bayesian oracle property has been studied at this general level. 
Given the success of the frequentist analogue studied by Fan and Li (2001), 
we believe our results will have applications in a wide variety of situations 
(in addition to the possible examples discussed in this paper). For exam¬ 
ple, although the applications in our SM focus on finite dimensional cases 
only, the relationships described in the main text obviously allow increasing 
dimensions as well. 
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6 SUPPLEMENTARY MATERIALS for “A 
Note on Bayesian Oracle Properties” 

Wenxin Jiangxi and Cheng L0 

July 22, 2015 

6.1 Mean oracle property and convergence rate of model 
selection 

In this section we define a different kind of oracle property (the “mean 
oracle property"), which is typically more demanding than model selection 
consistency, but can be meaningful in more circumstances than the oracle 
property (the “posterior oracle property") in the main text. 

We will now define the mean oracle property and show that it requires 
more than the model selection consistency: the error rate vt(Mq) needs to be 
sufficiently small. In some situations the (quasi-posterior n is not useful but 
its mean E{9) — f 6dn is useful, which may have a good limiting distribution 
for statistical inference for a parameter of interest 6 o, even when the (quasi- 
posterior distribution itself does not have a valid interpretation. This can 
happen for quasi-posteriors when its credible region does not have asymptot¬ 
ically correct coverage probability (see, e.g., Chernozhukov and Hong 2003). 
We would like a mean oracle property such that 

I E(0) - E(9\M 0 )\ = o p (l)\E(9\M 0 ) - 6 0 \. 

In this case, it is not useful enough to establish | E{6) — E(9\M 0 ) \ = o p ( 1), 
since, e.g., the convergence rate \E(9\M 0 ) — 9 0 1 is typically O p (n~ 1//2 ) for 
regular cases. Instead, we hope the difference | E{9) — E(6\Mfi)\ between the 
model average posterior mean and the oracle posterior mean is smaller to a 
higher order. Note that 

I E{9) - E(9\M 0 )\ = 7t(Mq)\E(9\Mq) - E(9\M 0 )\, 

Therefore for bounded the parameter 6 , if 

7t(Mq) = o p (n” 1 / 2 ), (fast convergence) 

8 Shandong University and Northwestern University, wjiang@northwestern.edu 

9 Duke University, cl332@stat.duke.edu 
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then we have the mean oracle property. We will construct situations when 
this kind of “fast convergence” for model selection will happen. We will later 
comment that in many other cases, the rate o p (n -1 / 2 ) cannot be guaranteed. 
(BIC approximation on the posterior implies that vt(Mq) can be of order 
O p (n~ 1//2 ) due to candidate models that have only 1 redundant parameter.) 

A different relation can be useful in this case: 

E{0) - E(9\M 0 ) = Ti(M j )[E{6\Mj) - E(9\M 0 )}. (1) 

o 

The mean oracle property holds if there is a finite number of model candi¬ 
dates and each n(Mj)[E(9\Mj) — E(9\M 0 )] = o p (l)\E(9\M 0 ) — 9 0 | for j ^ 0. 
We have to check each model Mj G Mq (i.e. Mj Mq). Each product 
in the sum of (JT]) can be small enough for 2 different reasons. For mod¬ 
els which misses nonzero parameters, 7T (Mj) is typically exponentially small. 
For models that does not miss nonzero parameters but includes redundant 
parameters, E(9\Mj) — 9 0 is typically of the same order as E(9\M 0 ) — 9 0 , and 
therefore E(9\Mj ) — E(9\M 0 ) is also of the same order as E(9\M 0 ) —9 0 . Then 
it is enough to have 7r(Mf) = o p (l). 

Since we have established that both oracle properties (of the mean and of 
the distribution) depend on the posterior probability of selecting wrong models 
7t(Mq) , we devote the following sections on bounding this probability. 

6.2 Risk convergence and model selection convergence 

In this section, we show a general way to bound the posterior probability 
of selecting wrong models, by how often the posterior proposes suboptimal 
values of a risk function. 

We have argued that the oracle properties are determined by the model 
selection error tt(Mq). We now provide its bound due to a relation to risk 
convergence. We will define models Mj as restricting 9 e Oj for possi¬ 
bly overlapping parameter spaces Oj (possibly dependent on sample size n), 
j = 0,1,2,... and we use Mq denote the true model. We will consider a 
risk function R(j,9) : Uj>o({j} X Oj) i —y 3ft, which can “risk identify” the 
true model, roughly in the sense that inf R over M 0 is lower than inf R over 
Mq. Then, if we can prove posterior risk convergence, in the sense that the 
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posterior will favor a small risk R, then we can hope to show that posterior 
will favor the true model M 0 . 

To be more rigorous, let S = Uj>oSj and Sj = ( {j } x Qj). We will 
define a true model M 0 : (j, 9) e Sq. Then Mq corresponds to the event 
(j, 9) e U j >0 Sj. Define the gap 7 = inf Uj>oSj . R - infu^s.,- R - 

Then we obviously have 

7t(Mq) < 7 r(R > inf R + 7 ). (2) 

s 

This connects model selection convergence to risk convergence at the rate of 
gap. A zero gap will make the above relation useless, however. 

We will consider several risks for R to make a nonzero gap. 

1. (separated parameter spaces) The risk R(j,9) = C(9) is not depen¬ 
dent on j. In this case, the gap would be zero and make the above relation 
useless, if the true model is nested in some other models: @0 C Qj for 
some j > 0. Instead, we will consider separated parameter spaces0 Im¬ 
pose a separation condition of the parameter spaces (in minimal set dis¬ 
tance) d(@o, Uj>o©j) > S(> 0), under a suitable metric d. If we have 
an identihability condition on the “true parameter” 9 0 = arg mine 0 C{6)\ 
d{9 , 9 0 ) > 6 ==>■ C[6 ) — C{9 0 ) > y(> 0), then the gap can be taken to be at 
least 7 > 0. (We allow partial identification that the minimizer 9 0 can be a 
nonsingleton set.) 

2. (penalized risk) The risk is dependent on j, and includes a penalty 
against the complexity of Qj. Then © 0 C ©j for a more complicated Qj, j > 
0, does not automatically lead to a useless zero gap. For example, suppose 
R(j, 9) = C(6) + dj'y, where dj is a complexity penalty such that dj > do + 1 
for any model Mj with 0 ? D ©o- Then although inf©^. C{9) — inf © 0 C(9) = 0, 
infog©^ R(j, 6) — inf 0 e 0 o A(0, 6) > 7 . Suppose that for all other j > 0 (such 
that Qj 7 i © 0 ), we have inf©. C{9) — inf © 0 C(9) >- 7 (in the sample dependent 
sense, where 7 > 0 decreases with sample size n). Then the gap can be taken 

10 This is related (and may be extended) to the use of nonlocal priors on the 9 compo¬ 
nents (see, e.g., Johnson and Rossell 2012), which can be used to reduce prior masses near 
the nested true model. 
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as 7 > 0 . 


3. The risk function R in Case 1 above can be the Hellinger distance 
dn(fe, fo 0 ) between the parameterized densities, or the regression mean square 
error ||fig — fig 0 1| 2 between the parameterized mean functions, or the classifi¬ 
cation error of a parameterized classification rule. Then the current relation 
(J2D connects model selection consistency to any posterior risk consistency 
results in a model average context, in Hellinger distance, in regression or 
classification, such as in Jiang (2007). Then we can prove oracle properties 
due to the fundamental equivalence that we have established in our paper, 
between model selection and posterior oracle properties. This scheme can be 
symbolically represented as: 

risk convergence => model selection consistency = oracle property. 

6.3 Quasi-posterior and risk convergence 

In this section, we show a general way to bound the posterior probability 
of proposing suboptimal values of a risk function, by using its empirical risk 
to construct the (quasi-)posterior. 

We now consider a class of quasi-posteriors, where 


-A R, 


tt oc e 


where ft is a prior, R n is an empirical risk, and A > 0 is a scaling parame¬ 
ter that can depend on n, which is analogous to the inverse temperature in 
statistical physics. Typically A = n, as in usual Bayesian posterior, where 
—A R n is the log likelihood. However, we will allow other rates on A >~ 1 in 
our general formulation. 

In this case, (we will prove in Section 16.7H that the risk convergence rate 
in R can be bounded as 


7T(R > inf R + 7 ) < e -0.5A( 7 -r-2M) 


(3) 


where 
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and 


u = — (2A) In 


e -2\[(R u -R)-f d<t,(R n -R)} d( p^ 


4> (x e XR n 

is the “limiting version” of the quasi-posterior n, where theoretical risk R is 
used in place of the empirical risk R n . 

The term u measures the difference R n — R on the support of the limiting 
posterior. We here use the simplest uniform bound 

|tt| < 2 sup | R n — R,\ 

where the supremum is taken over the entire support of the prior. This will 
typically lead to u — O p (dlnn/y/n) (due to a uniform large deviation the¬ 
orem) where d is a complexity measure related to the maximal parameter 
dimension. 


The term r can similarly be bounded by r = 0(rflnA/A) (due to a Laplace 
approximation of R) 0 This latter rate can also be derived by the inequality 


-A " 1 In / e~ x{R -' miR) dK < inf 

/ a>0 


In 


k(R — inf R < a] 


and choosing a = din A/A. Detailed argument is similar to the remarks after 
Proposition 1 in Li, Jiang and Tanner (2014). 


Therefore, if the gap 7 >~ r + 2\u\ and 7 >- 2.1 lnn/A ( a n -< b n or b n >- a n 
means a n /b n —> 0 as n —> 00 ), then 

7t(Mq) < 7 t(R > inf R + 7 ) -< e" lnn = 1/n, 

and we achieve the fast convergence of model selection, and therefore the ora¬ 
cle properties for both the posterior distribution and the posterior mean hold. 


11 Tighter bounds on |u| and r are possible where the maximal complexity d is replaced 
by the complexity of the true model do- This can be achieved by expressing the integrations 
over dn and d<f> using model averaging. We do not do this here, since we are treating finite 
dimensional models here with bounded d. 
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It is noted that when R depends on both the model index j and 9, such as 
the penalty based R = C{6)+dj 7 discussed in the last subsection, in order for 
|it| < 2 sup | R n — R | to be smaller than the gap 7 , we can let R n = C n (0)+dj^, 
where C n is a usual sample version of C with sup \C n — C\ typically of order 
d In n /a/h as mentioned before. This leads to a quasi-posterior 

7 T oc e~ XCn ne~ Xldj . 

This corresponds to a strong complexity penalty on the model complexity 
dj. Such a dimensional penalty is not needed when we use separated param¬ 
eter spaces (corresponding to a local prior), as in discussed in the previous 
subsection. 

Note that both these methods (penalty prior or separated parameter 
spaces) are very general, and neither requires differentiability of R n , or point- 
identification of 60 = arg min©,, C. 

A third method will allow more general priors but less general R n which 
allows a BIC (or Laplace) type approximation. 

6.4 BIC approximation and model selection conver¬ 
gence 

In this section we express the BIC approximation with a general scaling 
parameter, which can be used later to treat nonstandard asymptotics. 

In the BIC approach, the complexity penalty will only come indirectly 
from approximating an integral. We do not need to directly penalize the risk 
function. Therefore the risk function R(j, 6 ) = C ( 6 ) does not depend on the 
model index j and the empirical risk R n is C n { 6 ). 

Assume that the integral in the posterior model probability 

tv(M j) oc f e~ XCn dn (4) 

4'}x0j 

satisfies a BIC type approximation 

—A -1 In / e~ XCn dn = C n (9j) H-— • dim(0j) + O p (A _1 ), 

J{j}xej 2 A 
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where 9j = argmin 0 e e. C{9) is assumed to be unique for convenience, and C 
is the pointwise limit of C n in probability. 

Then denoting dj = dim(©j), we have 

n(M j )/Tr(M 0 ) = O p (exp{-Amax{C' n ( 6 > J ) - C n (9 0 ), 0.5 (dj - d 0 ) In A/A}}) 

Any ‘wrong’ models where inf^qxQ^, C —inf C — Cj > 0 (and C n (9j)—C n (9 0 ) = 
c j + Op(1)) will have an exponential rate 7 r(Mj)/ 7 r(M 0 ) = O p (e~°' 9cjX ). Any 
‘true’ models where inf/,i x o . C — inf C = 0 will satisfy 9,: = 9 0 , C n (9~) — 
C n (9 0 ) = 0, and the second term in the maximum dominate^. Therefore, 
any overly complex ‘true’ model with dj > do will have a polynomial rate 

ir(Mj) /tt(M 0 ) = Op(( lnA/A)^- d °)/ 2 ). 

Suppose that there is no simpler ‘true model’ that has dj < d 0 other than 
M 0 , and that there is a fixed number of candidate models. Then vt(Mq) = 
o p (l), and the posterior oracle property holds. In addition, the posterior 
mean can also satisfy the oracle property due to comments following ([I]). 
This is because the probability of selecting overly complex true models is 
o p (l), and the probability of selecting ‘wrong’ models is exponentially fast 
in A. Assume that the scaling parameter A is polynomial in n, and that all 
the ‘true’ models with inf jj} x e C — inf C = 0 have a common polynomial 
convergence rate e n for E{9\Mj ) — 9 0 . Then \E{9) — 6 * 0 ! = o p (e n ) due to (P). 

6.5 Cubic root asymptotics 

In this section, we argue that the BIC approximation can be applied to 
a nonstandard case considered by Jun, Pinske and Wan (2012). Our mean 
oracle property implies that one of the nonstandard convergence rates from 
Jun, Pinske and Wan (2012), who did not consider model selection but as¬ 
sumed the true model to be known, remains true even after model selection, 
i.e., even if the true model is unknown. 

The BIC condition for (J4]) is usually derived from a quadratic approxima¬ 
tion of the empirical risk C n . It is noted the BIC condition may still hold even 

12 This is not generally true for “nonnested” true models as described in Hong and 
Preston (2012), Section 4.2.2. 
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when C n is not differentiable. For example, for predictors X % and binary re¬ 
sponses Zi, when C n had a discontinuous form C n = rr l Y^i=i ZJ[Xj9 > 0], 
its expectation C = EC n may still have a quadratic approximation. Fol¬ 
lowing the cubic-root asymptotics as described in Jun, Pinske and Wan 
(2012) (their Theorem 1), we find that the BIC condition can still hold 
for an unusual scale parameter A -< n 2//3 Jll The heuristics of the condition 
A -< n 2//3 is as derived in Jun, Pinske and Wan (2012). For model M 0 where 
9 0 = argmin © 0 C{9) is assumed to be unique (and similarly for all other 
models): A(C„(0) - C n (6 0 )) « A [C{0) - C{9 0 )] + A [{C n - C){0) - (C n - 
C)(6 l o)]. The second square bracket is stochastic and O p ( 77, _1 / 2 ||0 — 0 O ||^) 
where /? = 0.5, instead of 1 , due to the indicator functions in C n . The first 
square bracket is nonstochastic and can be approximated by a quadratic form 
0.5(0 — 9o) T V(9 — 9q) for some positive definite matrix V. If the stochastic 
term is dominated by the quadratic nonstochastic term, then we can ignore 
the stochastic term and have a BIC type approximation. This will happen 
when An _1//2 ||0 — 0 O ||^ -< A||0 — 0 O || 2 = t 2 ~ 1 where t is called a rescaled 
parameter. This leads to An _1 / 2 A _/3/ ' 2 -< 1 or A -< = n 2//3 . 

As commented in the last section, the fact that the BIC condition can still 
hold for A -< n 2//3 implies both the posterior oracle property and the mean 
oracle property. The posterior convergence rate is, due to the quadratic 
approximation above, A -1 / 2 , which is slower than n ~ 1 ^ 3 due to the condition 
on A. In this case, the posterior oracle property is often not useful because the 
posterior is a quasi-posterior that does not guarantee a valid interpretation. 
However, the posterior mean is asymptotically normal with no bias if A >- n 2//5 
(see the comments after Theorem 1 of Jun, Pinske and Wan 2012). Therefore, 
the mean oracle property is still useful. Applying Case (iii) of Theorem 1 
in Jun, Pinske and Wan (2012), we have that J 9dn is asymptotic normal 
for the true parameter, at a convergence rate nr^A 1 / 4 , which is faster than 
n -1 / 3 and slower than n ~ 2//5 for our choice of A. The contribution of our 
mean oracle property basically says that this rate from Jun, Pinske and Wan 
( 2012 ), who did not consider model selection but assumed the true model 
to be known, remains true even after model selection, i.e., even if the true 
model is unknown. 

13 We note that 7r(Mj) is proportional to the denominator of the equation of (44) of 
Jun, Pinske and Wan (2012), which is shown to converge to its dj-dimensional quadratic 
approximation in probability, which implies the BIC condition. 
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6.6 Partial identification 


In this section, we study the partial identification caused by moment in¬ 
equalities. We first study the oracle posterior, where the model is given, and 
show that the limiting posterior is nonstandard due to partial identification. 
Secondly, we introduce an example to show that defining a meaningful true 
model can be very subtle due to partial identification; simpler models which 
fit the data as well are not always better than more complex models. We 
finally show that to be conservative, one can define a mixture true model by 
allowing all models that are compatible with the data, and that this mixture 
true model is meaningful and can be consistently selected. 

6.6.1 Nonstandard limiting distribution due to partial identifica¬ 
tion. 

First we assume that the model is given and explain how partial identifi¬ 
cation leads to a nonstandard limiting posterior. Suppose we are interested 
in making inference about some parameter 9 that satisfies a vector of moment 
inequalities Em(0) > 0, where m{6) is a vector of random moment functions 
that depend on the data. We can define tt oc e~ XRn n where A is the scaling 
parameter that could depend on n, n is a prior for 9, 

R n {9) = d 2 {fh(0), M + ) = inf d 2 (rh,fi) 

ip> o 

for some (minimal set) distance d, R + represents the set of "0-vectors with 
all nonnegative components, and fh denotes the sample average of m{6). 
When d is the Euclidean (possibly weighted) and A = 0.5n, this becomes the 
choice of empirical risk in Chernozhukov, Hong and Tamer (2007), and also 
corresponds to a Laplace approximation of the posterior used in Liao and 
Jiang (2010). When the identification region Q = [6 : d 2 (Em(9),R + ) = 0] 
has positive prior probability, typically one has, in total variation distance, 

drv(vr, k(-|H)) = o p (l), (5) 

where the limiting distribution is k(0J12) oc k{9)I{9 G 12), i.e., the prior trun¬ 
cated in the identification region 12o This limiting distribution has contrac¬ 
tion rate 1 , instead of the usual rate ttG 1 / 2 , due to partial identification, and 
is generally nonnormal. 

14 Heuristically, e~ XRn converges to 1(9 £ 12) almost surely for large n, and therefore 
7 r oc ne~ XRrl converges to k(9)I(9 G 12). 
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6.6.2 Subtlety of consistent model selection due to partial identi¬ 
fication. 

In this subsection, we claim that due to the uncertainty about the true pa¬ 
rameter associated with partial identification, it is NOT always good to favor 
simpler models that are compatible with the data. We need to be conservative 
to allow possibly nonzero parameters in the model. 

Now we proceed to model selection and describe the subtlety of defining 
a meaningful true model due to partial identification. In a similar situa¬ 
tion of moment inequalities, Liao and Jiang (2010) have considered model 
selection consistency using complexity penalization. However, as pointed 
out in Liao (2010), their consistent procedure for joint model and moment 
condition selection may miss the true parameter. Liao (2010) considered a 
counter-example (Example 3.4.1) with selection of moment conditions. 

A simpler counterexample with only model selection is here presented. 
Suppose, (for j = 1,2), 9 3 = EXjYj, Yj > 0, X 3 is unobserved but only 
known to be between L 3 and Uj ( Lj < X 3 < Ufi). Then we have the fol¬ 
lowing partially identified moment constraints: d{Em{9),^>) = 0, where 
mj = EUjYj — 9j , m 3+2 = 9 3 —EL 3 Y 3 , (for j = 1,2), T = (M + ) 4 . The identifi¬ 
cation region O = [9 : d(Em{9), T) = 0] = [ELfiY^ EU X Y\] x [EL 2 Y 2 , EU 2 Y 2 \. 
We consider the quasi-posterior 

Ti oc e~ XRn n 

where A = 0.5n and R n = inf,/, e ^(m — 'if) T V~ 1 (rh — fi) — d 2 (rh, T), with a 
given positive definite symmetric matrix V, which corresponds to d(a, b ) = 

y/ia-hyv-^a-h). 

A model which proposes 9 e ©j is compatible if QjDfl is nonempty. This 
will make inf q & q 3 R = inf A = 0, where R = d 2 (Em{9), T). Incompatible 
models have inf^ge^ R > 0, which can be shown to have ignorable posterior 
probabilities due to the earlier sections that discuss risk convergence. 

Consider three candidate models 9 6 ©i = {0} x [—1,1], @ 2 = [—1,1] x 
[-1,1],0 3 = [-1,1] x {0}. Suppose that the identification region is = 
[EL\Y],EU\Y\\ x [EL 2 Y 2l EU 2 Y 2 ] = [0.3, 0.4] x [-0.2, 0.6], 
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The second component 9 2 has therefore an intrinsic uncertainty for model 
selection. It is possible that the true parameter is 9* = (0,0), but it is also 
possible that the true parameter 6 * = (0,0.5), since both 9 2 = 0 and 0.5 
fall in [—0.2, 0.6] = [EL 2 Y 2 , EU 2 Y 2 \. The data cannot tell between the two 
possibilities. 

A complexity penalized model selection would select the simplest model 
that is compatible with the moment constraints, where a compatible model 
is such that inf^©^ R = inf R = 0, or equivalently, Qj D hi is nonempty. 
In our case, both models © 2) 3 are compatible but not ©i, and the simplest 
compatible model is @ 3 . Yet, this is a wrong model when 9* = (0, 0.5), since 
the nonzero 9% component is missed by © 3 ! 

6.6.3 A possible solution to the subtlety of consistent model se¬ 
lection 

We note that in this partial identification case, we cannot simply choose 
the simplest compatible model. A consistent model selection method that 
chooses the simplest compatible model may be wrong. Liao (2010) proposes 
to give up forcing the model selection consistency and use a meaningful prior 
to allow all compatible models to be represented with nonvanishing posterior 
probabilities. We will view this problem in a different perspective. We will 
still view it as a problem of consistent model selection, but to be conservative, 
we redefine the true model to be a mixture distribution over all models that 
are compatible with the data. Then the model selection procedure based on 
a quasi-posterior will still be consistently selecting this mixture true model. 
We will also derive the limiting posterior given this mixture true model, and 
argue that it makes sense intuitively due to partial identification. 

For model j and 9 e Qj, let n(j, 9) oc e~ XRn ^v(j,9) where v denotes 
the prior distribution and R n = d 2 {fn(9), T) is the empirical risk. Let 
R{9) = d 2 (Em(9), T) be the theoretical risk. Let the identification region be 
fl = [9 : R{9) = 0] and assume that it is nonempty. 

A compatible model j is such that Qj fl fl 7^ 0 and inf 0e © j R{9) = 0. 
We will say that j G Jq. A model is incompatible if Qj fl fl = 0 or 
infflge. R(6) > 0. We will say that j 6 J\ . Let g = Iwij^j, inf^ge R(9) 
and assume it to be positive (which will be true if there is a finite number of 
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candidate models.) 

To be conservative, we would like to allow all compatible models to be 
kept in the large sample limit. We can therefore attempt to group j G Jo to¬ 
gether to be the true model Mq and relabel it as k = 0. Then one can rewrite 
n(k,9) oc e~ XRn ^Kk^{d\k) where k 0 = v(j G J 0 ), k^O) = v(9\j G Jo) = 
^j&Jo YljeJo can rec °g n i ze d as a mixture prior over all com¬ 

patible models, and {nkn{9\k) : k > 0} take same values as {^(j, 0) : j G Ji}. 

Then the current formalization is converted to the same form as Section 
O The gap parameter 7 in ([2]) and (|3]) can be taken as g here to obtain: 

7r(M 0 C ) < e -°-5A(ff-r—2|«|)_ 

This is typically exponentially small in A and n, which will imply oracle prop¬ 
erties for both the posterior distribution and the posterior mean. 

Due to the discussions for (J5]l, we know that the limiting oracle prior 
(which is also the limit of the posterior) is 

k{6 |fi, k = 0) oc k(6\0)I(6 G fi) oc j 1(9 G fi). 

This is a mixture of priors over all compatible models, truncated in the 
identification region fl Even though this limit is very different from the 
classical square root n normal limit, it is a meaningful and makes common 
sense. The problem is partially identified: the data identify only the region 
and within the region, information only comes from the priors of all models 
that are compatible with the data. 

6.7 Proof of Propositions 

Proof of Proposition 1: 

For any event A and B, 7 r(J) = 7r( A\B)n(B)+7r(A\B c )t:{B c ) and 7r( A\B) = 
7r(+ 7r( A\B)n(B c ). Then 

|7 t(A) - 7 r(A\B)\ = \n(A\B c ) - n(A\B)\n(B c ) < 7 r(B c ) 
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for any A, and the upper bound is reachable at A = B c . Setting B = [M = 
M 0 ] leads to the proof. Q.E. D@ 

Proof of Proposition 2: 

In the proof of Proposition 1 above, we could replace Mo by M and obtain 
supz, \tt(A) — tt(A\M)\ < 7 r(M c ). The right hand side is at most 7 t(Mq) since 
7r(M) > 7r(M 0 ). Now combine this with the result of Proposition 2 using the 
triangular inequality leads to the proof. Q.E.D. 

Proof of Proposition 3@ 

We first show that 


ln7r(h) <O.51n0(/i 2 ) — Xu, (6) 

where 7r (h) = f hdn and <fi(h 2 ) = f h 2 d(p and so on. This is proved using the 
definition: 

f e~ XRn hdu f e~ x( ' Rn ~ R ^hdcj) 
f e~ XRn du f e~ x ^ Rn ~ R ^d(j) 

Then use the Jensen’s inequality for the denominator and the Cauchy-Schwartz’s 
inequality for the numerator to get 


7r(/i) < 


Jf f h 2 dcj) 


g—A j(R n —R)d<t> 

which leads to ([ 6 ]). Then we take h = 1(A) and obtain 

In tt(A) < 0.51n</)( J 4) — Xu. 
Then take A — [R — inf R > 7 ], and note that 


g 2A[(iJ n R) f(R n R)d(j>] (hp. I / h 2 d(f), 


(7) 


<P(A) = 


f e~ x ^ R -' iniR h(A)dK 


= / e~ xiR - iniR - r) I(A)dK < e~ x ^~ r) . 


f e X ( R mf 

Then applying this upper bound of cp(A ) to ([7]) leads to the proof. Q.E.D. 


15 Therefore, for any probability measure p and event B , we have that the total variation 
distance \p — p(-\B)\tv =p{B c ). 

16 We note that the very general relations (0 and 0 appearing in the proof below are 
assumption free and simple, and may be of interest themselves in bounding the quasi¬ 
posterior by its limiting version. 
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